ON THE INFINITE SWAPPING LIMIT FOR PARALLEL 

TEMPERING 
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Abstract. Parallel tempering, also known as replica exchange sampling, is an important method 
for simulating complex systems. In this algorithm simulations are conducted in parallel at a series of 
temperatures, and the key feature of the algorithm is a swap mechanism that exchanges configurations 
between the parallel simulations at a given rate. The mechanism is designed to allow the low 
temperature system of interest to escape from deep local energy minima where it might otherwise 
be trapped, via those swaps with the higher temperature components. In this paper we introduce 
a performance criteria for such schemes based on large deviation theory, and argue that the rate of 
convergence is a monotone increasing function of the swap rate. This motivates the study of the 
limit process as the swap rate goes to infinity. We construct a scheme which is equivalent to this 
limit in a distributional sense, but which involves no swapping at all. Instead, the effect of the 
swapping is captured by a collection of weights that influence both the dynamics and the empirical 
measure. While theoretically optimal, this limit is not computationally feasible when the number 
of temperatures is large, and so variations that are easy to implement and nearly optimal are also 
developed. 
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martingale, random measure 

AMS subject classifications. 60J25, 60J75, 60F10, 28D20, 60A10, 60G42, 60G57 

1. Introduction. The problem of computing integrals with respect to Gibbs 
measures occurs in chemistry, physics, statistics, engineering and elsewhere. In many 
situations, there are no viable alternatives to methods based on Monte Carlo. Given 
an energy potential, there are standard methods to construct a Markov process whose 
unique invariant distribution is the associated Gibbs measure, and an approximation 
is given by the occupation or empirical measure of the process over some finite time 
interval [10]. However, a weakness of these methods is that they may be slow to 
converge. This happens when the dynamics of the process do not allow all important 
parts of the state space to communicate easily with each other. In large scale applica- 
tions this occurs frequently, since the potential function often has complex structures 
involving multiple deep local minima. 

An interesting method called "parallel tempering" has been designed to overcome 
some of the difficulties associated with rare transitions [H [6l [21] [22] . In this technique, 
simulations are conducted in parallel at a series of temperatures. This method does 
not require detailed knowledge of or intricate constructions related to the energy 
surface and is a standard method for simulating complex systems. To illustrate the 
main idea, we first discuss the diffusion case with two temperatures. Discrete time 
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models will be considered later in the paper, and there are obvious analogues for 
discrete state systems. 

Suppose that the probability measure of interest is ^{dx) cx e~^^^'/'^^dx, where 
Ti is the temperature and F : R'^ — > M is the potential function. The normalization 
constant of this distribution is typically unknown. Under suitable conditions on V , 
fjL is the unique invariant distribution of the solution to the stochastic differential 
equation 

dX = -\IV{X)dt + ^/WidW, 

where is a d-dimensional standard Wiener process. A straightforward Monte Carlo 
approximation to fi is the empirical measure over a large time interval of length T, 
namely, 



1 

T 



T+B 

^x{t){dx)dt, 



B 



where Sx is the Dirac measure at x and B > denotes a "burn-in" period. When V 
has multiple deep local minima and the temperature ti is small, the diffusion X can 
be trapped within these deep local minima for a long time before moving out to other 
parts of the state space. This is the main cause for the inefficiency. 

Now consider a second, larger temperature T2. If Wi and W2 are independent 
Wiener processes, then of course the empirical measure of the pair 

dXi = -VV{Xi)dt + ^/TTidWi (1.1) 
dX2 = -VV{X2)dt + V^dW2 

gives an approximation to the Gibbs measure with density 

7r(a;i, a;2) oc e ^1 e ^2 . (1.2) 

The idea of parallel tempering is to allow "swaps" between the components Xi and X2. 
In other words, at random times the Xi component is moved to the current location of 
the X2 component, and vice versa. Swapping is done according to a state dependent 
intensity, and so the resulting process is actually a Markov jump diffusion. The form of 
the jump intensity can be selected so that the invariant distribution remains the same, 
and thus the empirical measure of Xi can still be used to approximate /i. Specifically, 
the jump intensity or swapping intensity is of the Metropolis form ag{Xi, X2), where 

I NT. T^{x2,Xi) 

g{xi,X2) = lh—, r (1.3) 

■n[xi,X2) 

and a € (0, 00) is a constant. Note that the calculation of g does not require the 
knowledge of the normalization constant. A straightforward calculation shows that tt 
is the stationary density for the resulting process for all values of a [see (|2.2p ]. We 
refer to a as the "swap rate," and note that as a increases, the swaps become more 
frequent. 

The intuition behind parallel tempering is that the higher temperature compo- 
nent, being driven by a Wiener process with greater volatility, will move more easily 
between the different parts of the state space. This "ease-of-movement" is transferred 
to the lower temperature component via the swapping mechanism so that it is less 
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likely to be trapped in the deep local minima of the energy potential. This, in turn, is 
expected to lead to more rapid convergence of the empirical measure to the invariant 
distribution of the low temperature component. There is an obvious extension to 
more than two temperatures. 

Although this procedure is remarkably simple and needs little detailed informa- 
tion for implementation, relatively little is known regarding theoretical properties. 
A number of papers discuss the efficiency and optimal design of parallel tempering 
[8j[T6l[T5]. However, most of these discussions are based on heuristics and empirical 
evidence. In general, some care is required to construct schemes that are effective. 
For example, it can happen that for a given energy potential function and swapping 
rate, the probability for swapping may be so low that it does not significantly improve 
performance. 

There are two aims to the current paper. The first is to introduce a performance 
criteria for Monte Carlo schemes of this general kind that differs in some interesting 
ways from traditional criteria, such as the magnitude of the sub-dominant eigenvalue 
of a related operator [TTJ |25] . More precisely, we use the theory of large deviations to 
define a "rate of convergence" for the empirical measure. The key observation here 
is that this rate, and hence the performance of parallel tempering, is monotonically 
increasing with respect to the swap rate a. Traditional wisdom in the application 
of parallel tempering has been that one should not attempt to swap too frequently. 
While an obvious reason is that the computational cost for swapping attempts might 
become a burden, it was also argued that frequent swapping would result in poor 
sampling. For a discussion on prior approaches to the question of how to set the 
swapping rate and an argument in favor of frequent swapping, see [20|, 119) . 

The use of this large deviation criteria and the resulting monotonicity with respect 
to a directly suggest the second aim, which is to study parallel tempering in the 
limit as a — ^ cx). Note that the computational cost due just to the swapping will 
increase without bound, even on bounded time intervals, when a ^ oo. However, 
we will construct an alternative scheme, which uses different process dynamics and a 
weighted empirical measure. Because this process no longer swaps particle positions, 
it and the weighted empirical measure have a well-defined limit as a — >■ cx) which we 
call infinite swapping. In effect, the swapping is achieved through the proper choice 
of weights and state dependent diffusion coefficients. This is done for the case of both 
continuous and discrete time processes with multiple temperatures. 

An outline of the paper is as follows. In the next section the swapping model in 
continuous time is introduced and the rate of convergence, as measured by a large 
deviations rate function, is defined. The alternative scheme which is equivalent to 
swapping but which has a well defined limit is introduced, and its limit as a — oo is 
identified. The following section considers the analogous limit model for more than 
two temperatures, and discusses certain practical difficulties associated with direct 
implementation when the number of temperatures is not small. The continuous time 
model is used for illustration because both the large deviation rate and the weak limit 
of the appropriately redefined swapping model take a simpler form than those of dis- 
crete time models. However, the discrete time model is what is actually implemented 
in practice. To bridge the gap between continuous time diffusion models and discrete 
time models, in Section 4 we discuss the idea of infinite swapping for continuous time 
Markov jump processes and prove that the key properties demonstrated for diffusion 
models hold here as well. We also state a uniform (in the swapping parameter) large 
deviation principle. The discrete time form actually used in numerical implementa- 
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tion is presented in Section 5. Section 6 returns to the issue of implementation when 
the number of temperatures is not small. In particular, we resolve the difficulty of 
direct implementation of the infinite swapping models via approximation by what we 
call partial infinite swapping models. Section 7 gives numerical examples, and an 
appendix gives the proof of the uniform large deviation principle. 

2. Diffusion models vifith two temperatures. Although the implementation 
of parallel tempering uses a discrete time model, the motivation for the infinite swap- 
ping limit is best illustrated in the setting where the state process is a continuous time 
diffusion process. It is in this case that the large deviation rate function, as well as the 
construction of a process that is distributionally equivalent to the infinite swapping 
limit, is simplest. In order to minimize the notational overhead, we discuss in detail 
the two temperature case. The extension to models with multiple temperatures is 
obvious and will be stated in the next section. 

2.1. Model setup. Let denote the Markov jump diffusion process of 

parallel tempering with swap rate a. That is, between swaps (or jumps), the process 
follows the diffusion dynamics (|I.ip . Jumps occur according to the state dependent 
intensity function ag{Xi, ^2 )• ^ jump time t, the particles swap locations, that is, 
{Xf{t),X^{t)) = {X^{t-),X^{t-)). Hence for a smooth functions / : M'' x ^ R 
the infinitesimal generator of the process is given by 

C''fixi,X2) - - {V^J{X,,X2),VV{X,)) - {V^J{X,,X2), VV{X2)) 

+ ntr [yl,^Jixi,X2)] +T2tT [V^^^^/(a;i,a;2)] 
+ ag{xi,X2) [f{x2,xi) - f{xi,X2)] , 

where Vxif and V^.^./ denote the gradient and the Hessian matrix with respect 
to Xi, respectively, and tr denotes trace. Throughout the paper we also assume the 
growth condition 

lim inf {VV{x),x/\x\) ^ 00. (2.1) 

1 — i-oo a;:|x|>r 

This condition not only ensures the existence and uniqueness of the invariant dis- 
tribution, but also enforces the exponential tightness needed for the large deviation 
principle for the empirical measures. 

Recall the definition of tt in ()1.2|) and let be the corresponding Gibbs probability 
distribution, that is, 

fi{dxidx2) = TT{xi,X2)dxidx2 oa e ^1 e ^2 dxidx2. 

Straightforward calculations show that for any smooth function / which vanishes at 
infinity 

/ C''f{xuX2)Kdxidx2)=0. (2.2) 

Since the condition (|2.1|) implies that V{x) — > 00 as \x\ — ^ 00, by the Echeverria's 
Theorem [51 Theorem 4.9.17], /i is the unique invariant probability distribution of the 
process iX'^,X^). 
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2.2. Rate of convergence by large deviations. It follows from the previ- 
ous discussion and the ergodic theorem [Ij that, for a fixed burn-in time B, with 
probability one 



as T — ^ oo. For notational simplicity we assume without loss of generality that B = 
from now on. A basic question of interest is how rapid is this convergence, and how 
does it depend on the swap rate a? In particular, what is the rate of convergence of 
the lower temperature marginal? 

We note that standard measures one might use for the rate of convergence, such as 
the second eigenvalue of the associated operator, are not necessarily appropriate here. 
They only provide indirect information on the convergence properties of the empirical 
measure, which is the quantity of interest in the Monte Carlo approximation. Such 
measures properly characterize the convergence rate of the transition probability 



V{x,dy-t) ^P{{X^{t),X^{t)) e dy\{X^{0),X^{0)) ^ x} , x,y eW' x M'', 



as t — oo. However, they neglect the time averaging effect of the empirical mea- 
sure, an effect that is not present with the transition probability. In fact, it is easy 
to construct examples such as nearly periodic Markov chains for which the second 
eigenvalue suggests a slow convergence when in fact the empirical measure converges 
quickly [T7]. 

Another commonly used criterion for algorithm performance is the notion of 
asymptotic variance [TOl [HI [23]. For a given functional / : M'' X M'' ^ R, one 
can establish a central limit theorem which asserts that as T — ;> oo 



The magnitude of a is used to measure the statistical efficiency of the algorithm. 
The asymptotic variance is closely related to the spectral properties of the underlying 
probability transition kernel [71 [T^. However, as with the second eigenvalue the 
usefulness of this criterion for evaluating performance of the empirical measure is 
not clear. 

In this paper, we use the large deviation rate function to characterize the rate 
of convergence of a sequence of random probability measures. To be more precise, 
let 5 be a Polish space, that is, a complete and separable metric space. Denote by 
V{S) the space of all probability measures on S. We equip V{S) with the topology of 
weak convergence, though one can often use the stronger r-topology [3]. Under the 
weak topology, 'P{S) is metrizable and itself a Polish space. Note that the empirical 
measure is a random probability measure, that is, a random variable taking values 
in the space 1^(8). 

Definition 2.1. A sequence of random probability measures {jt} is said to 
satisfy a large deviation principle (LDP) with rate function I : 'P{S) — [0, oo], if for 
all open sets O C 'P{S) 






for all closed sets F C ViS) 



lim sup 7^ log P {7t ^ F} < — inf I{i^), 

and if {v : < Af } is compact in V{S) for all M < oo. 

For our problem all rate functions encountered will vanish only at the unique 
invariant distribution /z, and hence give information on the rate of convergence of 
Af.. A larger rate function will indicate faster convergence, though this is only an 
asymptotic statement valid for sufficiently large T. 

2.3. Explicit form of rate function. The large deviation theory for the empir- 
ical measure of a Markov process was first studied in [2] . Besides the Feller property, 
which will hold for all processes we consider, the validity of the LDP depends on two 
types of conditions. One is a so-called transitivity condition, which requires that there 
are times Ti and T2 such that for any a;, y e M'' x M'', 

/ e~*p{x,dz;t)dt <^ / e~'*p{y,dz;t)dt, 
Jo Jo 

where ^ indicates that the measure in z on the left is absolutely continuous with 
respect to the measure on the right. For the jump diffusion process we consider 
here, this condition holds automatically since VV^ is bounded on bounded sets, g is 
bounded, and the diffusion coefficients are uniformly non-degenerate. The second 
type of condition is one that enforces a strong form of tightness, such as (|2.1|) . 

Under condition (|2.1|) . the LDP holds for {\^ : T > 0} and the rate function, de- 
noted by 7", takes a fairly explicit form because the process is in continuous time and 
reversible [H [13] . We will state the following result and omit the largely straightfor- 
ward calculation since its role here is motivational. [A uniform LDP for the analogous 
jump Markov process will be stated in Section 4, and its proof is given in the appendix.] 

Let i/he a, probability measure on R'' x M.'^ with smooth density. Define 9{xi ,X2) = 
[diy/d^]{xi,X2)- Then can be expressed as 



= JoH+aJi(i/), (2.3) 



where 



Jo{l^) = I oaf \2 "^1 2:2)11^ +T2 ||V:r26'(xi, 0:2)11^ v{dxidX2) 



Ji{v)=[ g{xi,X2)e i xl S^^'^^l I v{dxidx2), 
Jr-'xR" \ V ^(2;i,a;2) J 

and where i (z) = z log z — z + 1 for z > is familiar from the large deviation theory 
for jump processes. 

The key observation is that the rate function I°'{v) is affine in the swapping rate a, 
with Jo{v) the rate function in the case of no swapping. Furthermore, Ji{y) > with 
equality if and only if 9{x2tXi) = 6{xi,X2) for v-a.e. {xi,X2). This form of the rate 
function, and in particular its monotonicity in a, motivates the study of the infinite 
swapping limit as a — >■ 00. 

Remark 2.2. The limit of the rate function 1°- satisfies 



= lim = 



Jo{v) 6{xi,X2) = 0{x2,xi) ly-a.s., 
00 otherwise. 
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Fig. 2.1. Temperature swapped and particle swapped processes 

Hence for /°° {ly) to be finite it is necessary that put exactly the same relative weight 
as on the points {xi,X2) and {x2,xi). Note that if a process could be constructed 
with I°° as its rate function, then with the large deviation rate as our criteria such 
a process improves on parallel tempering with finite swapping rate in exactly those 
situations where parallel tempering improves on the process with no swapping at all. 

2.4. Infinite swapping limit. From a practical perspective, it may appear that 
there are limitations on how much benefit one obtains by letting a — oo. When im- 
plemented in discrete time, the overall jump intensity corresponds to the generation 
of roughly a independent random variables that are uniform on [0, 1] for each corre- 
sponding unit of continuous time, and based on each uniform variable a comparison 
is made to decide whether or not to make the swap. Hence even for fixed and finite 
T, the computations required to simulate a single trajectory scale like a as a — > oo. 
Thus it is of interest if one can gain the benefit of the higher swapping rate without 
all the computational burden. This turns out to be possible, but requires that we 
view the prelimit processes in a different way. 

It is clear that the processes {Xi,X2) are not tight as a — ^ oo, since the number 
of discontinuities of size 0(1) will grow without bound in any time interval of positive 
length. In order to obtain a limit, we consider alternative processes defined by 

= -WV{Y,-)dt + ^2nl^z^^oy +2T2l[z^=,ydWi (2.4) 
dY^^ = -VV{Y2^)dt + ^2r2l{^.=o}+2ril{2o=i}dVF2 

where Z°' is a jump process that switches from state to state 1 with intensity 
ag{Yi ,Y2) and from state 1 to state with intensity ag{Y2 ,Yi). Compared to con- 
ventional parallel tempering, the processes (^7^,^^°) swap the diffusion coefficients at 
the jump times rather than the physical locations of two particles with constant dif- 
fusion coefficients. For this reason, we refer to the solution to (|2.4p as the temperature 
swapped process, in order to distinguish it from the particle swapped process {X^, ^2)- 
We illustrate these processes in Figure [5?T] Note that the solid line and the dotted 
line represent X° and , respectively. These processes have more and more frequent 
jumps of size 0(1) as a ^ 00. In contrast, the process (Y^, Y2) have varying diffusion 
coefficient. The figure attempts to also suggest features of the discrete time setting, 
with both successful and failed swap attempts. 

Clearly the empirical measure of {Y-^^^ Y2) does not provide an approximation to 
fi. Instead, we should shift attention between (¥^,^2) and {Y2,Y^) depending on 
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the value of Z°' . Indeed, the random probabiHty measures 



l{za(t)=o}%"W.?2''(t)) + l{z-(t)=i}'^(iT(t),v'i°(t)) 



dt 



(2.5) 



have the same distribution as 



'5(Xf(s),XJ(s))'^S, 



and hence converge to /i at the same rate. However, these processes and measures have 
well defined limits in distribution as a — > oo. More precisely, we have the following 
result. For the proof see [9]. Related (but more complex) calculations are needed to 
prove the uniform large deviation result given in Theorem 14. II 

Theorem 2.3. Assume that \/V is locally Lipschitz continuous. Then for each 
T the sequence {Yi ^rj^) converges in distribution to (Y{^ ,riT) as a — >■ 00, 
where (Y^ , 1^°° ) the unique strong solution to 



dY{^ 



dY^ 



-\IV{Y^)dt + ^J 2Tip{Y^ ,Y^) + 2x3^(^3°°, Y^)dWi 
-VV{Y^)dt + J 2t2p{Y^,Y^) + 2Tip{Y^, F~)dW^2, 



(2.6) 



'•'^ r - - 

p{Yr {t) , {t))5^Yr (tiYr it)) + Pi^^ (t) ' m(Yr (t),Yr («)) 



dt, 
(2.7) 



and 



Tt{xi,X2) 



Tr{x2,xi) +7r(a;i,a;2)' 



The existence and form of the limit are due to the time scale separation between 
the fast Z° process and the slow (Yi, Y2) process. To give an intuitive explanation of 
the limit dynamics, consider the prelimit processes (|2.4p . Suppose that on a small time 
interval, the value of the slow process (YijY^) does not vary much, say (YijY^) ~ 
{xi,X2)- Given the dynamics of the binary process Z"', it is easy to verify that as 
a tends to infinity the fractions of time that = and = 1 are p{xi,X2) and 
p{x2,Xi), respectively. This leads to the limit dynamics (|2.6I) . When mapped back to 
the particle swapped process, p{xi, X2) and p{x2,xi) account for the fraction of time 
that (X°,X|) = {xi,X2) and (X",^^) = {x2,xi), respectively, which naturally leads 
to the limit weighted empirical measure ()2.7p . 

The weights pi and p2 do not depend on the unknown normalization constant, 
and in fact 

e 

p{Xi,X2)= ^(,^) v(xi) (2-8) 

and 

p{x2,Xi) = 1 - p{xi,X2) = 



V_(X2) y(x2) V(rri) 



The following properties of the limit system are worth noting. 
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1. Instantaneous equilibration of multiple locations. Observe that 
the lower temperature component of this modified "empirical measure," i.e., 
the first marginal, uses contributions from both components at all times, cor- 
rected according to the weights. The form of the weights in (|2.7I) guarantees 
that the contributions to ryf? from locations {Y{^, ¥2°) and {Y^,Y{^) are at 
any time perfectly balanced according to the invariant distribution on product 
space. 

2. Symmetry and invariant distribution. While the marginals of 77^ play 
very different roles, the dynamics of Yf° and are actually symmetric. 
Using the Echeverria's Theorem [Sj Theorem 4.9.17], it can be shown that 
the unique invariant distribution of the process {Yf°,Y^) has the density 

-[7r(a;i, 2:2) + n{x2,xi)]. 

It then follows from the ergodic theorem that rj^ ^ fi w.p.l as T — >■ 00. 
This is hardly surprising, since /i is the invariant distribution for the prelimit 
processes 

3. Escape from local minima. Finally it is worth commenting on the be- 
havior of the diffusion coefficients as a function of the relative positions of 
Y]°° and Y^ on an energy landscape. Recall that ti < T2. Suppose that 
Y]°° (t) is near the bottom of a local minimum (which for simplicity we set to 
be zero), while Y^{t) is at a higher energy level, perhaps within the same 
local minimum. Then 

V(V2} 

p{yi,y2) « v>(^,) « 1, p{y2,yi) = 1 - p{yi,y2) ~ 0. 

e _|_ g Ti 

Thus to some degree the dynamics look like 

dY{^ = -VV{Y{^)dt + ^/WidWi 
dY^ = -VV{Y^)dt + ^/2^dW2, 

i.e., the particle higher up on the energy landscape is given the greater dif- 
fusion coefficient, while the one near the bottom of the well is automatically 
given the lower coefficient. Hence the particle which is already closer to escap- 
ing from the well is automatically given the greater noise (within the confines 
of (ri,T2)). Recalling the role of the higher temperature particle is to more 
assiduously explore the landscape in parallel tempering, this is an interesting 
property. 

One can apply results from [2] to show that the empirical measure of the infinite 
swapping limit {Ty^P : T > 0} satisfies a large deviation principle with rate function 
I°° as defined in Remark l2.2l However, to justify the claim that the infinite swapping 
model is truly superior to the finite swapping variant (note that > for any 
finite a), one should establish a uniform large deviation principle, which would show 
that I°° is the correct rate function for any sequence {or : T > 0} C [0, 00] such that 
ot — >■ CX3 as T — 00. We omit the proof here, since in Theorem 14.11 the analogous 
result will be proved in the setting of continuous time jump Markov processes. 

3. Diffusion models v^rith multiple temperatures. In practice parallel tem- 
pering uses swaps between more than two temperatures. A key reason is that if the 
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gap between the temperatures is too large then the probabihty of a successful swap 
under the [discrete time version of the] Metropolis rule (|1.3p is far too low for the 
exchange of information to be effective. A natural generalization is to introduce, to 
the degree that computational feasibility is maintained, a ladder of higher temper- 
atures, and then attempt pairwise swaps between particles. There are a variety of 
schemes used to select which pair to attempt the swap, including deterministic and 
randomized rules for selecting only adjacent temperatures or arbitrary pair of tem- 
peratures. However, if one were to replace any of these particle swapped processes 
with its equivalent temperature swapped analogue and consider the infinite swapping 
limit, one would get the same system of process dynamics and weighted empirical 
measures which we now describe. 

Suppose that besides the lowest temperature ti (in many cases the temperature 
of principal interest), we introduce the collection of higher temperatures 

Tl < T2 < ■ ■ ■ < TK- 

Let y = (j/i, ?/2, ■ • • , Uk) S (R*^)^ be a generic point in the state space of the process 
and define a product Gibbs distribution with the density 

(y) = TT (2/1, y2, . . . , yx) oc e-^(«i)/^ie-^(^^)/^^ • • • e^^^^^)/^^ . 

The limit of the temperature swapped processes with K temperatures takes the form 

dY^ = -Vy {Y^) dt + v/2/9iiri + 2pi2T2 + . . . + 2piKTKdWi 

dY^ = -VF {Y^) dt + v/2/92iri + 2p22T2 + . . . + 2p2KTKdW2 



dY^ = -VF (y^) dt + yj2pKlTl + 2pK2T2 + ■■■ + 2pKKTKdWK. 

To define these weights pij it is convenient to introduce some new notation. 

Let Sk be the collection of all bijective mappings from {1,2, . . . , K} to itself. 
Sk has Kl elements, each of which corresponds to a unique permutation of the set 
{1,2,..., K}, and Sk forms a group with the group action defined by composition. 
Let cr~^ denote the inverse of a. Furthermore, for each a G Sk and every y = 

{yi,y2,---,yK) e (M'')^, define = {yaii),yai2), ■ ■ ■ ,ycr(K)) ■ 

At the level of the prelimit particle swapped process, we interpret the permu- 
tation a to correspond to event that the particles at location y = (yi, 7/2, ■ • ■ , 
are swapped to the new location y^ = (?/o-(i)j2/cr(2)i • • ■ Tycr(K))- Under the tempera- 
ture swapped process, this corresponds to the event that particles initially assigned 
temperatures in the order ti,T2, ■ ■ ■ ,tk have now been assigned the temperatures 

Ta-^{l),Ta-^(2), ■ ■ ■ ^Tcr-i^K)- 

The identification of the infinite swapping limit of the temperature swapped pro- 
cesses is very similar to that of the two temperature model in the previous section. 
By exploiting the time-scale separation, one can assume that in a small time interval 
the only motion is due to temperature swapping and the motion due to diffusion is 
negligible. Hence the fraction of time that the permutation a is in effect should again 
be proportional to the relative weight assigned by the invariant distribution to y^, 
that is, 



T^iVa) = 71" (ya(l),ya(2), ■ • ■ ,J/<7(K)) 
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Thus if 



wiy) = ^ -, — r, 

then the fraction of time that the permutation a is in effect is w{y^). Note that for 
any y, 

^ w{y„) ^ 1. 

Going back to the definition of the weights pij{y), i, j = 1, . . . , K, it is clear that they 
represent the limit proportion of time that the i-th particle is assigned the temperature 
j and hence will satisfy 

p^3 (y) = ^ (^'^^ • 

a: <7(i)=i 

Likewise the replacement for the empirical measure, accounting as it does for mapping 
the temperature swapped process back to the particle swapped process, is given by 

^t^^CY. y^iY7mY-(t)dt, (3.1) 

where (t) = [Y°°(t)]. = (F^;) (i), F-^) (0, ■ • ■ 

The instantaneous equilibration property still holds for the infinite swapping sys- 
tem with multiple temperatures. That is, at any time t € [0, T] and given a current 
position Y (t) — y, the weighted empirical measure t]'^ has contributions from all 
locations of the form y^. , ct € Sk , balanced exactly according to their relative contri- 
butions from the invariant density tt (y^^). The dynamics of Y are again symmetric, 
and the density of the invariant distribution at point y is 

Remark 3.1. The infinite swapping process described above allows the most 
effective communication between all temperatures, and is the "best" in the sense 
that it leads to the largest large deviation rate function and hence the fastest rate of 
convergence. However, computation of the coefficients becomes very demanding for 
even moderate values of since one needs to evaluate K\ terms from all possible 
permutations. In Section [5] we discuss a more tractable and easily implementable 
family of schemes which are essentially approximations to the infinite swapping model 
presented in the current section and have very similar performance. We call the 
current model the full infinite swapping model since it uses the whole permutation 
group S'if, as opposed to the partial infinite swapping model in Section [5] where only 
subgroups of Sk are used. 

4. Infinite swapping for jump Markov processes. The continuous time 
diffusion model is a convenient vehicle to convey the main idea of infinite swapping. 
In practice, however, algorithms are implemented in discrete time. In this section we 
discuss continuous time pure jump Markov processes and the associated infinite swap- 
ping limit. The purpose of this intermediate step is to serve as a bridge between the 
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diffusion and discrete time Markov chain models. These two types of processes have 
some subtle differences regarding the infinite swapping limit which is best illustrated 
through the continuous time jump Markov model. 

In this section we discuss the two-temperature model, and omit the completely 
analogous multiple-temperature counterpart. We will not refer to temperatures ti 
and T2 to distinguish dynamics. Instead, let ai{x,dy) and a2{x,dy) be two proba- 
bility transition kernels on R'^ given . One can think of ai as the dynamics under 
temperature for i = 1, 2. We assume that for each i ^ 1,2 the stationary distribu- 
tion fj,i associated with the transition kernel ai admits the density tt^ in order to be 
consistent with the diffusion models, and define 

^ = /ii X ^2, 7r(xi, 0:2) = 7ri(a:i)7r2(x2). 

We assume that the kernels are Feller and have a density that is uniformly bounded 
with respect to Lebesgue measure. These conditions would always be satisfied in 
practice. Finally, we assume that the detailed balance or reversibility condition holds, 
that is, 

ai{x,dz)'Ki{x)dx — ai{z,dx)ni{z)dz. (4-1) 

4.1. Model setup. In the absence of swapping [i.e., swapping rate a = 0], 
the dynamics of the system are as follows. Let X" = {X^{t) = {Xl{t) , X^{t)) : 
t > 0} denote a continuous time process taking values in M'' X R'^. The probability 
transition kernel associated with the embedded Markov chain, denoted by = 
{(XO(j),XO(j)):j-0,l,...},is 

P{X"{j + 1) e {dyi,dy2)\X°{j) = {xi,X2)} = ai{xi,dyi)a2ix2,dy2). 

Without loss of generality, we assume that the jump times occur according to a 
Poisson process with rate one. In other words, let {ri} be a sequence of independent 
and identically distributed (iid) exponential random variables with rate one that are 
independent of x'^ . Then 

X°{t)^x\j), for J2n<t<Y,n. 

i=l 1=1 

The infinitesimal generator of X'^ is such that for a given smooth function /, 

C°f{xi,X2)= / [f{yi:y2)~f{xi,X2)]ai{xi,dyi)a2ix2,dy2). 

Owing to the detailed balance condition (|4.ip . the operator £° is self-adjoint. 
Using arguments similar to but simpler than those used to prove the uniform LDP 
in Theorem 14. 1[ the large deviations rate function 1° associated with the occupation 
measure 

1 

T J 

can be explicitly identified: for any probability measure on R'' x R"* with v <^ n 
and 6 = dv/dji, 



= 1 - / y/ 0{xi,X2)9{yi,y2)n{xi,X2)ai{xi,dyi)a2ix2,dy2) dxidx2, 



and /° is extended to all of 7^(R'' x R'') by lower semicontinuous regularization. 
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4.2. Finite swapping model. Denote by {{X^{t), X^{t)) : i > 0} the 

state process of the finite swapping model with swapping rate a, and let X = 
{{X^{j),X^{j)) : j = 0,1,...} be the embedded Markov chain. The probability 
transition kernel for X is 

PiX'^ij + 1) e {dyi,dy2)\X''{j) ^ {xi,X2)} = -^—ai{xi,dyi)a2{x2,dy2) 

a + 1 

+ ^j-^ [g{xi,X2)S(^,j;^^^^){dyi,dy2) + (1 - g{xi,X2))5(^^^^^^){dyi,dy2)] , 

where g is defined as in (|1.3I) . Furthermore, let {rf } be a sequence of iid exponential 
random variables with rate (a + 1) and define 

X'^{t) = X''ij), for E^"^^<E<- 

1=1 i=l 

In other words, the jumps occur according to a Poisson process with rate a+1. Note 
that there are two types of jumps. At any jump time, with probability l/(a + 1) it 
will be a jump according to the underlying probability transition kernels ai and a2- 
With probability a/(a + 1) it will be an attempted swap which will succeed with the 
probability determined by g. As a grows, the swap attempts become more and more 
frequent. However, the time between two consecutive jumps of the first type will have 
the same distribution as 

TV" 



where N"^ is a geometric random variable with parameter l/(a+ 1). It is easy to argue 
that for any a the distribution of S"' is exponential with rate one. This observation 
will be useful when we derive the infinite swapping limit. 

The infinitesimal generator of X is such that for any smooth function / on 



£"/(a;i,a;2) = / [fiyi,y2) - fixi,X2)]ai{xi,dyi)a2{x2,dy2) 

+ ag{xi,X2)[f{x2,xi) - f{xi,X2)]- 

It is not difficult to check that the stationary distribution of X"" remains /i and that 
£° is self-adjoint. As before, the large deviation rate function for the occupation 
measure 

Vt = ^ 1^ 5x^it)dt (4.2) 

can be explicitly identified. Indeed, for any probability measure v on M'* x R'* with 
<C /i and ~ dv /d^ 

P{iy)^l"{i^)+aJ{iy), 

where 



Ji^)^l 9ixi,X2)i { v{dxi,dx2). 



Note that as before, /° is monotonically increasing with respect to a. Since J{v) > 
with equahty if and only if 6{xi,X2) — 9{x2,xi) z/-a.e., we have 

= Mm riiy) = i ^^^'^^ if 6'(a;i, X2) = 6'(a:2, xi) I'-a.s., 
0^00 1^ 00 otherwise. 

4.3. Infinite swapping limit. The infinite swapping hmit for X° as a — 00 

can be similarly obtained by considering the corresponding temperature swapped 
processes. Since the times between jumps determined by ai and a2 are always ex- 
ponential with rate one, the infinite swapping limit Y°° = (Y]°°, ^2°°) is a pure jump 
Markov process where jumps occur according to a Poisson process with rate one. In 
other words, 

Y^it) = Y°°ij), for ^ r, < t < ^ t„ 

1=1 i=l 

where Y'"" is the embedded Markov chain and {ri} a sequence of iid exponential 
random variables with rate one. Furthermore, the probability transition kernel for 
is 

P{Y^{j + l) e {dz,,dz2)\Y^{j) = (yi,j/2)} (4.4) 
= P{yi,y2)ai{yi,dzi)a2{y2, dz2) + p{y2,yi)a2{yi,dzi)ai{y2, dz2), 



where the weight function p is defined as in Theorem 12.31 It is not difficult to argue 
that the stationary distribution for Y"^ is 

fi{dyi,dy2) = ^[7i"(2/i, 2/2) + 7i"(?/2, 2/i)] ^2/1^2/2, 
and the weighted occupation measure 

= ^/^ [piyr{t),Y2"^m^Yrit),Yrit))+piY2'"{t),Yrm^^^^ dt 

(4.5) 

converges to p{dxi, dx2) — TT{xi,X2)dxidx2 as T — > 00. It is obvious that the dynam- 
ics of the infinite swapping limit are symmetric and instantaneously equilibrate the 
contribution from (Yi,l2) and (12,^1) according to the invariant measure, owing to 
the weight function p. 

We have the following uniform large deviation principle result, which justifies the 
superiority of infinite swapping model. Its proof is deferred to the appendix. It should 
be noted that rate identification is not covered by the existing literature, even in the 
case of a fixed swapping rate, due to the pure jump nature of the process. 

Theorem 4.1. The occupation measure {77^? : T > 0} satisfies a large deviation 
principle with rate Junction I°° . More generally, define the finite swapping model as 
in Subsection \4-.2\ Consider any sequence {ax : T > 0} C [0, cxi] such that ot ~> 00 
as T ^ 00, and interpret ot < 00 to mean that rff' is defined by |.^.l?[ ) with a — ax, 
and ar ^ 00 to mean that rff^ is defined by \4-5\j . Then {t}^ : T > 0} satisfies an 
LDP with the rate function I°° defined in equation (CT5l). 



5. Discrete time process models. 
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5.1. Conventional parallel tempering algorithms. In the discrete time, 
multi-temperature algorithms that are actually implemented, a swap is attempted 
after a deterministic or random number of time steps, with a success probability of 
the form (jl.3l) . The two temperatures corresponding to particles for which a swap 
is attempted can be chosen according to a deterministic or random schedule, and as 
noted previously are usually adjacent since otherwise the success probability ()1.3|) will 
be too small to allow efficient exchange of information. 

As before it suffices to describe the algorithm in the setting of two temperatures. 
As in Sectional let ai{x,dy) denote the probability transition kernel for temperature 
Ti whose stationary distribution has a density tt^ for i = 1,2. For now let N = 1/a 
be a fixed positive integer that determines the frequency of swap attempts. Let 
X = {{Xi{j),X2{j)) : = 0, 1, . . .} denote the state process. Then the evolution of 
the dynamics is as follows. For any integer k > 1 and (fc— l)(iV-|-l) < j < k{N+l) — 2, 

P{X{j + 1) e idyi,dy2)\X{j) = {xi,X2)} = dyi)a2(x2, (iy2) 

and for j = k{N + 1) - 1, 

P{^{j + 1) = {x2,Xi)\X{j) = {xi,X2)} = g{xi,X2), 
P{^{.1 + 1) = {xi,X2)\X{j) = {xi,X2)} = 1 - g{xi,X2). 

Thus a swap is attempted after every N ordinary time steps based on the underlying 
transition kernels ai and a2- The case N = 1/a with a an integer greater than one 
corresponds to the case where multiple swaps are attempted between two ordinary 
time steps. The unique invariant distribution oi X is fi{dxidx2) — n{xi, X2) dxidx2, 
regardless of the value of N, and the occupation measure 

1 '^"^ 
3=0 

converges to /x as J — 00 almost surely. 

Remark 5.1. Note that N could be random. For example, if N is chosen to be 
a geometric random variable with mean A, then X is exactly the embedded Markov 
chain of the pure jump Markov process X° with a = 1/A in Subsection 14.21 

5.2. Infinite swapping model. As with the continuous time case, to produce 
a well-defined limit one must consider the temperature swapped process and then 
consider the limit as swapping frequency tends to infinity. It turns out that the 
limit is exactly the embedded Markov chain for the pure jump Markov process in 
Subsection 14.31 That is, the infinite swapping limit in discrete time is a Markov chain 
Y°° = {iY{^{j),Y2°°{j)) : j = 0, 1, . . .} with the transition kernel 

P{yi,y2)ai{yi,dzi)a2{y2,dz2) + p{y2,yi)a2{yi,dzi)ai{y2,dz2). (5.1) 

The corresponding weighted empirical measure is 

, J-i 

"f^-jT. [p(^r(j)' ^2°°o-))%~(,).?,~o)) + PiY2^u), YnmY-uhYru)) ■ 
3=0 

(5.2) 

The generalization to multiple temperatures is also straightforward. Suppose that 
there are K temperatures. Denote the infinite swapping limit process by 1^ = {Y{j) : 
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j = 0, 1, . . .}, which is a Markov chain taking values in the space (R'')^. Given that 
the current state of the chain is Y°°{j) = y = . . . , j/i^), define as before the 
weights 

w{y) = 



7r(y) 



Then the transition kernel of Y°° is 



PiY°°{j + 1) e dzlY'^ij) = y) = ^ w(y^)ai(yo.(i),dz^(i)) • ■ ■ aK{ya(K),dz„i^K))- 
The discrete time numerical approximation to the invariant distribution is 

j=0 a£SK 

Remark 5.2. It is not difficult to derive large deviation principles for the discrete 
time finite swapping or infinite swapping models. However, it remains an open ques- 
tion whether the rate function is monotonic with respect to the swap rate (frequency). 
However, the discrete time large deviation rate function can be obtained from that 
of the continuous time pure jump Markov process models through the contraction 
principle, and the two coincide in the limit as the transition kernels a; correspond to 
an infinitesimal time step for the diffusion process (jl.ll) . Hence the discrete time rate 
function will be at least approximately monotone, and in this sense the infinite swap- 
ping limit should (at least approximately) dominate all finite swapping algorithms. 
This is supported by the data presented in Section [7] and the much more extensive 
empirical study presented in |14) . 

6. Partial infinite svi^apping. As noted in Section|31 the number of weights and 
their calculation can become unwieldy for infinite swapping even when the number of 
temperatures is moderate. In this section we construct algorithms that maintain most 
of the benefit of the infinite swapping algorithm but at a much lower computational 
cost. In the first subsection we describe the infinite swapping limit models when only a 
subgroup of the permutations of the particles (respectively, temperatures) are allowed 
by the prelimit particle (respectively, temperature) swapped process. The computa- 
tional complexity of these limit models will be controlled by limiting the number of 
permutations that communicate with each other through the swapping mechanism. 
The infinite swapping models in this subsection will be called partial infinite swapping 
models, as opposed to the full infinite swapping models in the previous section. The 
second subsection shows how such partial infinite swapping schemes can be interwoven 
to approximate the full infinite swapping model. 

6.1. Partial infinite swapping models. We consider subsets A of Sk with 
the property that A is an algebraic subgroup of Sk- That is, 

1. the identity belongs to A; 

2. a ai,a2 (z A then ci o cr2 G A, where o denotes composition; 

3. a a e A then S A. 

Although one can write down a partial infinite swapping model that corresponds 
to instantaneous equilibration for an arbitrary subset A, it is only when A is a sub- 
group that the corresponding partial infinite swapping process has an interpretation 

16 



as the limit of parallel tempering type processes. When alternating between partial 
infinite swapping processes, a "handofF rule will be needed, and it is only for those 
which correspond to subgroups that such a handoff rule is well defined. This point is 
discussed in some detail in the next section. 

The definition of the partial infinite swapping process based on A is completely 
analogous to that of the full infinite swapping process. The state process {Yij) : j = 
0, 1, . . .} is a Markov chain with the transition kernel 

oi^{y,dz) = ^ w"^ iVa-) Q-i{y„(^i),dz^(i)) ■ ■ ■ aK{ya(K),dz„t^K)) (6.1) 
and the weighted empirical measure is 

j=0 aeA 



where the weight function w is defined by 



and satisfies for any y 



We omit the dependence on both o = oo and A from the notation. Note that in 

contrast with the full swapping system, it is only those permutations of y corre- 
sponding to a € A that are balanced according to the invariant distribution in their 
contributions to fjj. 

To illustrate the construction we present a few examples. With a standard abuse 
of notation denote the permutation a such that a{i) = by the form (ai, a2, . . . , uk)- 
In particular, (1,2,..., K) is the identity of the group Sk- 

Example 6.1. Let K = A and A = {(1, 2, 3, 4), (2, 1, 3, 4)}. This corresponds to 
only allowing swaps between temperatures Ti and Ti at the prelimit. Define 

W{y)= 7^(2/1, J/2, 2/3, ?/4) 



7I"(2/l,l/2,2/3,y4) +7^(2/2,2/1,^3,2/4) 



The probability transition kernel of the corresponding partial infinite swapping process 
is given by 

w (2/1,2/2, 2/3,2/4) "1(2/1, d2;i)a2(2/2, ^^2)^3 (2/3, ^2:3)04(2/4, dz4) 
+ w (2/2, 2/1, 2/3, 2/4) "1(2/2, dz2)a2{yi,dzi)as{y3, ^2:3)04(2/4, dz4) 

and the contribution to the weighted empirical measure is 

(Yi, ^2, %, Y4) <5(^,,y„y3,F4) + ^ (^2, ^l, %, Y4) S (^y^^y^^y^^y^y 



W I 



Note that with Wij denoting the marginal invariant distribution on the i-th and j-th 
components, the weight function can be written as 

w{y) 



7ri2(2/l,2/2) +7I"12(2/2,2/l)' 
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which is consistent with the weights of the two-temperature model in Theorem[ 

Example 6.2. Again take K ^ A, but this time use the subgroup generated by 
(2,1,3,4) and (1,2,4,3), i.e., A = {(1, 2, 3, 4), (2, 1, 3, 4), (1, 2, 4, 3), (2, 1, 4, 3)}. Then 
the dynamics are given by 

w {yi,y2,y3,y4) ai{yi,dzi)a2{y2,dz2)a3{y3,dz3)a4{y4,dz4) 
+ w (2/2, yi, J/3, 2/4) ai{y2, dz2)a2(yi,dzi)a3{y3, ^03)04(1/4, dz4) 

+ yj {yi,y2,y4,y3) ai{yi,dzi)a2{y2,dz2)a3{y4,dz4)a4(y3,dz3) 

+ w {y2,yi,y4,y3)ai{y2,dz2)a2{yi,dzi)a3{y4,dz4)a4{y3,dz3) 

where the weight function w is defined by 

TT (2/1,^2, 2/3, y4) 



w{y) 



TT (2/1, 2/2, 2/3, 2/4) + 7r (2/2, 2/1,^3,2/4) +71- (2/1, 2/2, 2/4, 2/3) +71- (2/2, 2/1,^4,2/3) 
The contribution to the weighted empirical measure is 

w (Yi, 12, Fa, %) 6{YuY2y,,Y^) + (^2, Yi, I3, %) 5(Y^y,y,,Y^) 

+ W (Fi, 12, n, 1^3) \y,.Y,,Y,.Y,) + * (^2, ?1, n, I3) \Y,.Y,,Y,.Y,y 

Example 6.3. We let K ~ i and take A to be the subgroup of Sk generated 
by the rotation (2, 3, 1), i.e., A — {(1, 2, 3), (2, 3, 1), (3, 1, 2)}. Then the dynamics are 
given by 

w (2/1 , 2/2 , 2/3 ) ai (2/1 , rf^i ) "2 (2/2 , rf^2 ) ^3 (2/3 , £^^3 ) 
+ w (2/2 , 2/3 , 2/1 ) "1 (2/2 , dz2 ) a2 (2/3 ,dz3)a3{yi,dzi) 

+ w (2/3 , 2/1 , 2/2 ) "1 (2/3 , £^2:3 ) "2 (2/1 , rfzi ) as (2/2 , ^02 ) 



where 



~ I X 7^(2/1,2/2,2/3) 

w [y) - 



7r(2/i,2/2,2/3) +7r (2/2, 2/3, 2/1) + 7r (2/3, 2/i, 2/2) 
and the contribution to the weighted empirical measure is 

Wl {Yi,Y2,Y3)Sf^y^ Y2,Y,) + W2iY2,Y3,Yi)S,^y^ Y:,,Y,) +^3 (is,!!,!^;) ^(Ys,YuY,)- 

The first two examples would correspond to the infinite swapping limit of a stan- 
dard parallel tempering process, where swaps between only 1 and 2 are allowed in 
the first example, and swaps between 1 and 2 and swaps between 3 and 4 are al- 
lowed in the second. Note that the computational complexity does not increase sig- 
nificantly between the first and second example. The third example corresponds 
to a very different sort of prelimit process, in which "rotations" of the coordinates 
(2/1, y2, 2/3) (y2,y3,2/i) ^ (y3,yi,y2) (yi,y2,y3) are allowed. One can devise a 
Metropolis type rule that allows such "swaps" and yields the indicated infinite swap- 
ping system. 

6.2. Approximating full infinite swapping by partial sw^apping. In this 
section we consider the issue of alternating between such partial infinite swapping 
systems to approximate the full infinite swapping limit. Let A and B be subgroups of 
Sk- a and B are said to generate Sk if the smallest subgroup that contains A and B 
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is Sk itself. Note that the total number of permutations in ^ U B can be significantly 
smaller than Kl, the size of Sk- In fact, it is possible to construct subgroups A and 
B that generate Sk and that the total number of permutations in A U S is of order 
K. There is an obvious extension to more than two subgroups. 

Example 6.4. Let K ^ A and let A he generated by {(2, 1, 3, 4), (1, 3, 2, 4)} and 
B be generated by {(1, 3, 2, 4), (1, 2, 4, 3)}, respectively. Thus A is the collection of 
6 ■permutations that fix the last component and allow all rearrangements of the first 
three, while B fixes the first component and allows all rearrangements of the last three. 
Then A and B generate Sk- 

Example 6.5. Let K = A and let A and B be subgroups generated by {(2, 1, 3, 4)} 
and {(2,3,4,1)}, respectively. In other words, A — {(1, 2, 3, 4), (2, 1, 3, 4)} corre- 
sponds to only allowing permutations between the first two components, while B = 
{(1, 2, 3, 4), (2, 3, 4, 1), (3, 4, 1, 2), (4, 1, 2, 3)} corresponds to cycling of the four temper- 
atures. Then A and B generate Sk- 

To keep the computational cost controlled, one can approximate the full infinite 
swapping model by alternating between partial infinite swapping processes whose as- 
sociated subgroups generate the whole group. However, one must be careful in how 
the "handofF' is made when switching between different partial swapping models. It 
turns out that one cannot simply switch between different partial infinite swapping 
dynamics (i.e., transition kernels). Recall that in order to get a consistent approxima- 
tion to the desired target invariant distribution we do not use the empirical measure 
generated by Y, but rather a carefully constructed weighted empirical measure that 
works with several permutations of Y . Simply switching the dynamics and weights 
will in fact produce an algorithm that may not converge to the target distribution. 

To see how one should design a handoff rule, note that if one considers a collec- 
tion of transition kernels each having the same invariant distribution and alternates 
between them in a way that does not depend on the outcomes prior to a switch, then 
the resulting empirical measure will in fact converge to the common invariant distri- 
bution. This fact is used (at least implicitly) in the parallel tempering algorithm itself, 
where one alternates the pair of particles being considered for swapping according to 
deterministic or random rules so long as the random rules do not rely on previously 
observed outcomes. 

Now we use the fact that each partial infinite swapping model is a limit of cither 
a parallel tempering algorithm where only some pairs of particles are considered for 
swapping, or some more general form of parallel tempering which would allow groups 
of particles to simultaneously swap (according to an appropriate Metropolis-type ac- 
ceptance rule). An example in the earlier category would be A in Example l6.4[ which 
arises if only the pairs corresponding to temperatures ti,T2 and T2,T3 are allowed to 
swap, whereas an example in the latter category would be B in Example 16. 5[ which 
corresponds to allowing the particle at temperature Ti to move to the location of the 
particle at temperature Ti-i (with tq = T4), and the reverse. Furthermore, each of 
these partial infinite swapping models arises as a limit of transition kernels of the corre- 
sponding temperature swapped processes which preserve the same common invariant 
distribution. In taking the limit as the swap rate tends to infinity, the correspondence 
between particle locations for a particle swapped process and the "instantaneously 
equilibrated" temperature swapped process Y is lost. However, one can construct a 
consistent algorithm by reconstructing this correspondence. In fact one should choose 
the particle location according to the probabilities (under the invariant distribution) 
associated with the various permutations in the subgroup. See Subsection 16.31 for 
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more detailed discussion on the intuition behind this approximation algorithm. 

We next present an algorithm for alternating between two partial infinite swapping 
dynamics. The restriction to two is for notational convenience only. Suppose that the 
dynamics are indexed by the corresponding subgroups A and B, and that ua steps of 
subgroup A are to be alternated with ns steps of subgroup B. For simplicity we do 
not describe a "burn-in" period. As in (j6.1|) and (j6.2p we let a'^{y,dz) and w^{y) 
denote the transition kernel for A and the weights allocated to the permutation a € A, 
respectively, and similarly for B. 

Algorithm 6.6. (Approximation to full infinite swapping) 

1. Initialization: X'^(O) = ?(0;0) G (R'^)^^',^ = 1. 

2. Loop I: 

— —A 

(a) Initialization for A dynamics: set Y{1] 0) = X — 1). 
(h) Subgroup A dynamics: update Y{£;k),k = according to the 

transition kernel a"^, and add 

to the un-normalized empirical measure. 

(c) Reconstructing particle locations at the end of A dynamics: Let X {£) 
be a random sample from the set : a G A} according to the 
weights {«;"*(Y'<j(£; n^)) : a € A}. 

(d) Initialization for B dynamics: set Y{£;nA) ~ {£). 

(e) Subgroup B dynamics: update Y{£] k), k — + . . . , n^ + n^ according 
to the transition kernel , and add 

to the un-normalized empirical measure. 

— A 

(f) Reconstructing particle locations at the end of B dynamics: Let X (£) 
be a random sample from the set {Ya-{£] ha + ns) ■ <J G B} according to 
the weights {Y „{£\nA + JT-b)) : cr G i?}. 

(g) Set £ ^ £ + 1 and loop back to (a). 

3. Normalize the empirical measure. 

6.3. Discussions on the approximation. In this section we further discuss 
the intuition underlying the handoff rule between different partial infinite swapping 
dynamics and the approximation algorithm of the previous section. We temporarily 
assume that the model is in continuous time since the intuition is most transparent 
in this case. 

For simplicity let us assume that there are three temperatures and two groups 
A = {(1,2,3), (2, 1,3)} and B = {(1,2,3), (1,3,2)}. That is, under group A dynamics 
only pairwise swaps between the temperatures ri and T2 are allowed, while under the 
group B dynamics, only the swaps between T2 and are allowed. See Figure ISTTl 

Consider the following prelimit swapping model. Let the swap rate be a. The 
dynamics corresponding to group A and group B will be alternated on time intervals 
of length h. Hence on the interval [2fc/i, (2fc + l)h) the particle swapped process 
{Xi^X2tX^) only involves swaps between temperatures ti and T2. One can easily 
construct the corresponding temperature swapped process iY^.Y^^Y^) as before. 
Note that Xg = Y^ on this time interval. Similarly, on the interval [{2k + l)h, {2k + 

20 



Dynamics A 



Dynamics B 



Dynamics A 



t 



1 



t 



2 



Fig. 6.1. Approximation via partial infinite swapping 



2)h), only swaps between T2 and T3 are allowed and on this interval = F/*. Note 
that there is no ambiguity for the prelimit processes at the switch times t = h, 2ft,, . . . , 
since the locations of the particles are known. 

Now consider the limit as a — > 00 with h being fixed. Without loss of generality, 
we will only discuss how to deal with the switch of the dynamics at time t = h. 

— A 

On the time interval [0, h) we have the partial infinite swapping limit process Y = 
(Yi, 12,^3) that corresponds to the group A. Similarly it is clear that on the time 
interval [ft, 2ft) we should have the partial infinite swapping process corresponding 
to the group B. The problem is, however, by taking the limit, we lose the information 
on the locations of the particles (Xf ,^3). Unless we can somehow recover this 
information at the switch time t — h to assign Y (ft), we cannot determine the 
dynamics of Y on [ft, 2ft). The key is to recall that the infinite swapping limit 
instantaneously equilibrates multiple locations according to the invariant distribution. 

— A 

In other words, given Y (ft—) = y = (2/1,2/2,2/3), the locations of the particles are 
distributed according to 



Therefore, in order to identify the locations of the particles at time ft, we will take 

— A 

a random sample from this distribution once Y (ft—) is known. This explains the 
handoff rule used at the switch times of partial infinite swapping processes. 

Now we let ft —> 0. Since A and B generate the whole permutation group Sk, 
it is easy to check that at each time instant, the locations of {y^. : a E Sk} are 
equilibrated according to their invariant distribution, and therefore in the limit we 
will attain the full infinite swapping model. This can be made rigorous by exploiting 

— A — B 

the time scale separation between the slow diffusion processes {Y , Y ) and the fast 
switching process. We omit the proof because the discussion is largely motivational. 

Coming back to the discrete time partial infinite swapping model, it is clear that 
Algorithm 16.61 is nothing but a straightforward adaption of the preceding discussion 
to discrete time. The only difference is that one cannot establish an analogous result 
regarding approximation to the full infinite swapping model as in continuous time. 
The subtlety here is that in continuous time, as ft — 0, one can basically ignore any 



7r(2/i, 2/2, 2/3)'^(yi,^2,y3) + ^(?/2, 2/i, 2/3)%2,yi,»;3) 
7i'(2/i,2/2,2/3) +7i'(2/2,2/i,2/3) 
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effect from tfie diffusion on any smali time interval and assume tlrat the process is only 
making jumps between different permutations of a fixed triple (yi, 2/2, J/s)- This time 
scale separation is no longer valid in discrete time. In this setting, the performance 
of a scheme based on interweaving partial infinite swapping schemes lies between 
parallel tempering and full infinite swapping, and computational results suggest that 
it is closer to the latter than the former. 

The issue of which interwoven partial schemes will perform best is an open ques- 
tion. In practice we have used schemes of the following form. Suppose that a set 
of say 45 temperatures is given. We then partition 45 into blocks of sizes 3, 6, . . . , 6, 
with the first block containing the lowest three temperatures, the second block the 
next six, and so on. Dynamic A then is given by allowing all permutations within 
each block. Note that the complexity of the coefficients is then no worse than 6!. 
In Dynamic B we use the partition 6, 6, . . . , 6, 3. The form of the partial scheme is 
heuristically motivated by allowing the largest possible overlap between the different 
blocks when switching between dynamics, subject to the constraint that blocks be of 
size no greater than 6. 

7. Numerical examples. In this section we present data comparing parallel 
tempering at various swap rates and both full and partial infinite swapping. We 
present what we call "relaxation studies." The quantity of interest is the average po- 
tential energy of the lowest temperature component under the invariant distribution. 
In these studies, the system is run a long time to reach equilibrium, after which it is 
repeatedly pushed out of equilibrium and we measure the time needed to "relax" back 
to equilibrium. Each cycle consists of temporarily raising the temperatures of some 
of the lowest temperature components for a number of steps sufficient to push the av- 
erage potential energy away from the "true" value (as measured by either sample or 
time averages) . The temperatures are then returned to their "true" values for a fixed 
number of steps, and the process is then repeated 2,000 times. We plot the average of 
the 2,000 samples as a function of the number of moves, and the performance of the 
algorithm is captured by the rate at which these averages approach the correct value. 




Figures 3 and 4 present data for a Lennard- Jones cluster of 13 atoms, using 
the "smart Monte Carlo" scheme of [18] for the simulation of the dynamics, which 
produces a relatively large move in configuration space for each step. The "true" value 
is approximately -42.92. This is a relatively simple model, and was studied using only 
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4 temperatures. The temperatures are dropped to the true values at step 50. Infinite 
swapping converges more rapidly than any of the parallel tempering schemes. We see 
in Figure 3 that the most efficient of the parallel tempering schemes appears to use an 
attempted swap rate of around 64%. [The rates that would typically be used in such 
calculations are in the range of 5-10%.] Figure 4 magnifies a portion of the graph, 
but plots only the best parallel tempering result and adds a partial infinite swapping 
result based on blocks of the form 1,3 and 3,1, and with a handoff at each Metropolis 
step. Little difference is observed between the partial and full forms, though exclusive 
use of either of the partial forms by itself performs poorly. 
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Figure 5 

The Lennard- Jones cluster of 13 atoms is not a particularly demanding problem, 
but is presented so a comparison can be made between the full and partial infinite 
swapping forms. A much more complex example is the Lennard- Jones cluster of 38 
atoms. Data for this example obtained using a 45 temperature ensemble is given in 
Figure 5. Because full infinite swapping is impossible for this larger computational 
ensemble, we use the partial form. For comparison, results are also presented for 
parallel tempering. 

The details concerning the computational methods underlying both the parallel 
tempering and infinite swapping results of Figure 5 along with a discussion of the tem- 
perature ensemble involved for this example can be found in [14 . Briefiy summarized, 
as with the previous 13-atom Lennard- Jones example. Figure 5 denotes the results 
of a series of relaxation experiments. Here, however, 45 temperatures are used with 
the lowest 15 being involved in the heating/cooling process. The heating and cooling 
cycles consist of 1200 smart Monte Carlo moves, each of one unit Lennard- Jones time 
duration. The cooling segment is taken as the portion of the cycle from moves 200 
to 800 with the remainder being the heating portion. During the cooling portion of 
the cycle the 45 temperatures in the ensemble cover the range from (0.050-0.210) in 
temperature steps of 0.005, and from (0.210 - 0.330) in steps of 0.010 while during 
the heating portion of the cycle temperatures less than or equal to 0.150 are set equal 
to 0.150. The results shown in Figure 5 are obtained using 600 thermal cycles. 

The 38-atom Lennard- Jones cluster has an interesting landscape. In particular, 
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while the global and lowest-lying local minima are similar in energy, the minimum 
energy pathway that separates them involves appreciably higher energies and contains 
13 separate barriers fM^, Chapter 8.3]. As discussed in jH], the partial infinite swap- 
ping approach is appreciably more effective than conventional tempering approaches 
in providing a proper sampling of this complex potential energy landscape. 

8. Appendix. 

8.1. Proof of Theorem 14.11 Throughout the proof, we let S = Si x Si, where 
Si C M'^ is convex and compact. Let 7^(5*) denote the Polish space of all probability 
measures on S equipped with the topology of weak convergence. For any probability 
measure v G V{S), define its mirror image G T^iS) by requiring 

v^'{A X B) = v{B X A) 

for all Borel sets A,_B C M''. Furthermore, as in the rest of the paper, a bold symbol 
X <E S means x = {xi,X2), where xi,X2 G M'', and x^ ~ {x2,xi). We also use the 
notation 

a{x,dy) ^ ai{xi,dyi)a2{x2,dy2), 

which is a probability transition kernel defined on S given S. 

To prove the uniform large deviation principle, it suffices to prove the equivalent 
uniform Laplace principle [3j Chapter 1] . To simplify the proof we have assumed that 
S is compact. This would be the case if, e.g., V is defined with periodic boundary 
conditions. The general case can be handled under (j2.1l) by using y as a Lyapunov 
function [3j Section 8.2]. It will be convenient to split this into upper and lower 
bounds. We also consider just the (more complicated) case where ct — > oo but 
ax < oo for each T. Allowing ax — oo requires a different notation to handle this 
special case, but does not change the structure of the proof otherwise. 

We will show for any bounded continuous function F : 7^(5) — > M that 

hm logs [exp{-TF(r;^-)}] = inf + /-(^.)] . (8.1) 

T^-oo 1 yeV{S) 

By adding a constant to both sides we can assume F > Q, and do so for the rest of 
this section. 

The proof of the uniform large deviation principle is based on the weak con- 
vergence approach. The proof is complicated by the multiscale aspect of the fast 
swapping process, as well as the fact that rf^ is a weighted empirical measure that 
involves this fast process. 

8.2. Preliminary results. 

8.2.1. A representation. We first state a stochastic control representation for 
the left hand side of (|8.ip . As with the derivation of the infinite swapping process 
via weak convergence, it will be necessary to work with the (distributionally equiv- 
alent) temperature swapped processes for tightness to hold. In the representation, 
all random variables used to construct TyjT are replaced by random variables whose 
distribution is selected, and both the distributions and the random variables will be 
distinguished from their uncontrolled, original counterparts by an overbar. For this 
reason, while the continuous time process is denoted by Y°'{t), we change notation 
and use U"'{j) rather than Y'^{j) to denote the discrete time process. 
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We first construct the temperature swapped process. Let Q;(a;,dy|0) = a{x,dy) 
and a{x,dy\l) — a{x^,dy^), and let {Nj,j — 0,1,...} be iid geometric random 
variables with parameter 1 / (1+a), i.e. geometric random variables with mean a. Then 
the random variables {U''{j),j = 0, 1, . . .}, {Ml{j),j = 0, 1, ...,£ = 0, 1, . . .,N^} are 
constructed recursively as follows. Given Mq (j) = z and U°'{j) = x, U°'{j + 1) is 
distributed according to a{x, dy\z). The process Mf{j),t = 0, 1, . . . , is a Markov 
chain with states {0, 1} and transition probabilities 



p{Q,Q\x) ^ g{x) p[Q,l\x) ^l- g{x) 
0|a;) = 1 - g{x") p{\, {)\x) - g{x^) 



(8.2) 



The initial value for the subsequent interval is given by MQ{j + 1) = M'^a (j). Letting 
{tj^, i — 0,1, ...,£ = 0,1, ... , — 1} be iid exponential random variables with mean 
1/a, the temperature swapped process in continuous time is then given by 

Y\t) = V^ij) for ^ ^ < t < ^ 5] 
(with the convention that the sum from to —1 is 0), 

Z'^it) = Mt{j) for EE<fc < ^ < EE^fc. 

i=Q k=0 i=0 k=0 

and lastly the ordinary and weighted empirical measures are given by 

1 1 

■0T = ^ y Sya^^f^dt and r]^ = — J [l{Z''{t)=o}SY-(t) + '^{Z''{t)=i}SY-{t)i^] dt. 

Let cr" denote the exponential distribution with mean 1/a and let /3° denote the 
geometric distribution with mean a. For the representation, all distributions [e.g., 
a{x, dy\z)], can be perturbed from their original form, but such a perturbation pays 
a relative entropy cost. We distinguish the new distributions and random variables 
by using an overbar. Given T G (0,oo), let i?" and be the discrete time indices 
when the continuous time parameter reaches T, i.e., 

E E ^^^+ E ^H-i,fc<T^< E E ^^fc+E^H"-M- (8-3) 

i=0 k=0 k=0 i=0 k=0 k=0 

In this representation the barred quantities are constructed analogously to their 
unbarred counterparts. Thus, e.g., R'^ and TVf are defined by (j8.3|) but with r^^j. 
replaced by . Random variables corresponding to any given value of j are con- 
structed in the order [/"(j + l), N"^, M/(j), f^^, ^ = 0, 1, . . . , N°-, and then j is updated 
to j + 1. Barred measures, which are also allowed to depend on discrete time, are used 
to construct the corresponding barred random variables, e.g., U {j + 1) is (condition- 
ally) distributed according to aj{U (j), -IMq (j)). The infimum is over all collections 
of measures {aj, /3°-,pj^i, d'jg} and, although this is not denoted explicitly, any partic- 
ular measure can depend on all previously constructed random variables. To simplify 
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notation we let N'i^^ denote K"". We state the representation for {ry^}, and note that 
an analogous representation holds for {ipx}- 

Lemma 8.1. Let G : 7^(5') x N — > M he hounded from below and measurahle. Then 
the representation 



+ 1 [^(c^.([^"WriMo"W)||«(t/'^w,-iMon*))) + i?(^riir)] 

+ f^^ [RiP^AK{^)r\U^{^))\\pmi^)r\U''m+R{nh\W')] 
is valid. 

The proof of such representations follow from the chain rule for relative entropy 
(see, e.g., [U Section B.2]). A novel feature of the representation here is that the total 
number of discrete time steps is random. However, this case can easily be reduced to 
the case with a fixed deterministic number of steps. 

8.2.2. Rate for the ordinary empirical measure. Notation for marginals.| 

We will frequently factor measures on product spaces in the proof. For a (determin- 
istic) probability measure on a product space such as x S'^ x , with each 5* a 
Polish space, we use notation such as 1^12 to denote the marginal distribution on the 
first 2 components, and notation such as Vi^^ to denote the conditional distribution 
on the first component given the third. When j/ is a measurable random measure 
these can all be chosen so that they are also measurable. 

We will make use of the rate function for the ordinary empirical measure. Let 
(p{x,dy) — a{x,dy\0)pQ{x) + a{x,dy\l)pi{x), where po{x) — p{x) and pi{x) = 
p{x^), and let fl{dx) — [Tr(x) + Tr{x^')]dx/2 be its unique invariant probability distri- 
bution. If 7 is absolutely continuous with respect to p with k{x) = [d'y/dpKx), then 
set 

Note that K is convex. We then extend the definition to all of 7^(5*) via lower semi- 
continuous regularization with respect to the weak topology. Thus if 7^ — ?> 7 in 
the weak topology and if each 7^ is absolutely continuous with respect to p, then 
liminfi iC(7i) > K{'-f), and we have equality for at least one such sequence. Note 
that since p, is mutually absolutely continuous with respect to Lebesgue measure, this 
means that K{j) < 1 for all 7 e V{S). 

The following lemma will help relate weak limits of quantities in the representa- 
tions to the rate function of the ordinary empirical measure. 

Lemma 8.2. Let 7 £ 'P{S) he absolutely continuous with respect to p and let k = 
[d'y I dp] . Assume A e (0, 00), v G V{SxS) is such that [v\i — [i']2, R{i^\\ < 00, 
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- - log S [exp{-TG(77?^, R^)}]= inf E 



r is such that r[v\i = K/i, and < — J log r{y)[h']i (dy) < oo. Then 



if (7) = 1 - y y/K{x)n{y)fl{dxMx, dy) (8.4) 
< AR{v\\[iy]i ^ip)~ A j \ogr{y)[v]i{dy) + A log A - A + 1. 

Proof. Let C be the set where k{x) — 0. Then — / logr(y)[i^]i(dy) < 00 imphes 
r(y) > a.s. with respect to [i']i{dy), and we also have / r{y)[v]i{dy) = 1, so that 
r{y) < 00 a.s. with respect to [h']i{dy). It follows that Hi(C) = 0. Now suppose 
that 



y/K{x)K{y)fi{dx)ip{x, dy) = 0. 

Then K(x)K{y) — a.s. with respect to Lebesgue measure, and so if k{x) 7^ 
then ip{x,{y : K{y) > 0}) = 0, or ip{x,C) = 1. Thus v{{S\C) x C) = 0, while 
[z^]i (g) ip{{S\C) X C) = 1, which implies R{i'\\ ^(f) — 00, which is a contradiction. 
We conclude that 



\/ K{x)K{y)^{dx)(p{x, dy) > 0. 



Since also 



it follows that 



v/ K{x)K{y)^i{dx)ip{x, dy) < 1, 



-log J \/ K{x)K{y)fi(dx)tp{x, dy) G [0, 00). 

Since <Si (f) < 00 and since /I is the invariant distribution of tp it follows 

that i?([i^]i||/2) < 00 [3j Lemma 8.6.2]. This means that 

/log^4~vMi(f^a;) < 00, 
J r[x) 

and since ~ J logr{y)[h']i{dy) < 00, it follows that — J log K{x)[h']i{dx) > —00. From 
~ [v\2 we conclude that 

" ^ y [log n{x) + log K{y)] u{dx, dy) > -00. (8.5) 

By relative entropy duality ([3, Proposition 1.4.2]) 



-log J y/ K{x)K{y)fi{dx)ip{x, dy) 

= -log J e^[i°s«W+iog'.(y)]^(^3.)^(3,^^y) 

< R{v\\ll(»<f) J [logK(a;) +logK(y)] v{dx,dy) 
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is valid as long as the right hand side is not of the form oo — oo, which is true by 
The chain rule then gives 

-log J e5P°g«(==)+i°s«(«)l/i(da;)^(a;,dy) 
< R{v\\p,(^ if) — / \ogK{x)[h']i{dx) 



R{v\\[iy]i(g) ip) + R{[v]i\\n) - J logK{x)[iy]i{dx) 
R{iy\\[:^]i <S) if) ~ logr{x)[iy]i{dx), 



and thus 



^/'^J^xj^y)Ji{dx)ip{x,dy) < _e-«(''IIMi®'^)+^'°s'-(=^)Mi('^="). 



Then (|8^ follows from the fact that if a G M and be {0,oo) then -e"" < ab+b log 6-6 
by taking a — R{i'\\[h']i (g) tp) — J \ogr{x)[h']i{dx) and b = A. D 

8.2.3. Decomposition of the exponential distribution. In the construction 
of Tyf. we used independent exponential random variables r^f. and geometric random 
variables N^, and the fact that for each i 



is exponential with mean one. This deomposition corresponds to a relationship in 
relative entropies, which we now state. 

Lemma 8.3. For a € (0, oo), let N"' be distributed according to a random probabil- 
ity measure f3"' on {0, 1, . . .}. Given iV" = £, for k e {0, 1, . . . /} let f^{i) be distributed 
according to a random probability measure <y'^{() on [0, oo). Let a be the distribution 
of the random variable 



k=0 



The 



E 



AT"-! 



k=0 



> 



E [R {a\\a^)] . 



Proof. Define a measure ^ on the space N+ x ni^oP' °*^) ^ follows. For any 
^ G N+ and any sequence Aq,Ai,... of Borel measurable subsets of [0, oo), 



{{1} xAoxA,x...)=(3-{l)l[ a-^ (A,.) , 

fc=0 

and similarly define the measure p, by 

fi ({/} X Ao X X . . .) = (0 n '^fc" w (^'^■) n '^'^ (^'») 
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Then by the chain rule of relative entropy 



E[R{fi\\tJi)]=E 



Since 



E 



j2 Ria-.mwa'^) 



fe=0 



E 



= E 



E 



^ i?(a^(7V«)||a«) 

fe=0 

-1 



it follows that 



E[R{fi\\^)] = E 



i=0 fe=0 



AT" 



i?(r||/3")+ ^ i?(a,"(iV'^)||a«: 



fe=o 



Observe that r can be written as a measurable mapping on N+ x Ili^ol'-'' '-^) ^'^'^ 
that (T and are the distributions induced on [0, oo) under that map by /i and /j., 
respectively. Since relative entropy can only decrease under such a mapping, it follows 
that ER{il\\n) > ER{a\\a^). □ 

8.3. Lower bound. The proof of the lower bound will be partitioned into three 
cases according to a parameter C e (l,oo). After the three cases have been argued, 
the proof of the lower bound will be completed. The first two cases are very sim- 
ple, and give estimates when i?°/T is small (i.e., unusually few exponential clocks 
with mean 1 are needed before time T is reached) or when R°-/T is large (i.e., an 
unusually large number of of such clocks are needed to reach tome T). The pro- 
cesses {U°'{j)} and {M^{j)} play no role in these estimates, and the required es- 
timates follow from Chebyshev's inequality as it is used in the proof of Cramer's 
Theorem. We will need the function h{b) = — log b + b— 1, b G [0, oo), which satisfies 
ij3f{R{-f\\a^) : Ju-yidu) = b} =h{b). 

8.3.1. The case R'^/T < l/C. Let F : V{S) ^- R be non-negative and contin- 
uous. Then 

~\ogE [exp{-rF(r?^) - ool{[o,i/c]}=(i?7r)}] 
= -l\ogE [exp{-Tii^(77?;)} l{[o,i/c]}(i«Vr)] 




By Chebyshev's inequality, for a € (0, 1) 

( LT/CJ + l 



p 



J e ^*=o » > e 



< exp I ( [T/CJ + 2) {\og ^ - aC^ | 
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Optimizing this inequality over a e (0, 1) gives 



liminf - ^log£; [exp{-ri^(,7?,) - ool{[oa/c]}^ (^V^)}] 



T-foo 

>\hninf^ilT/C\+2)h{C) 
_ hjC) 

c ■ 

Note that h{C) /C ^ 1 as C ^ oo. 

8.3.2. The case R"^/T > C. With F as in the last section, an analogous argu- 
ment gives 

liminf - ^logi? [exp{-TF(r,?,) - ool{[c.oo)}=(i?77^)}] 
> liminf ^ilT{C^ 1)J +2)h f-l- 



{C-l)h 



1 



cx) as C 



Note that (C - 1) (1/ (C - 1)) 

8.3.3. The case 1/C < E'^/T < C. To analyze this case it will be sufficient to 
consider any deterministic sequence {r°} such that 1/C < r"/T < C, and such that 
r°/T — A as T — cx), and obtain lower bounds on 

liminf log i? [exp{-TF(r7?,) - ^l^,. /ryiRVT)}] . 

1 — ^oo ± 



Since is convex, to prove the lower bound for (IS.ip we can assume F is convex 
and lower semicontinuous |3l Theorem 1.2.1]. The representation from Lemma 18.11 
will be applied. We first note that the representation will include a term of the form 
collar" /T]''{R°' I T) on the right hand side. We can remove this term if we restrict the 
infimum to controls for which i?" — w.p.l, and do so for notational convenience. 
The representation thus becomes 



1 



T 

= inf E 



log E [exp{-r^^(77?,) - ^l{,ay {R^)}] (8.6) 
mT) + ^Y. [R{o^^{U\^),^MS{^))\\a{iy\l)r\^m)) 



i=0 fc=0 



+R{nkh'') 



There are four relative entropy sums in (j8.6p . Since F is bounded from below, the 
lower bound holds vacuously unless each such term is uniformly bounded. 
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We first show that the empirical distribution on converges to 5^0 by using a 
martingale argument. For Ci,C2 C [0,oo), let 



i=0 



Consider the one-point compactification of [0,oo), which we identify with [0,cxd], and 
left / : [0,00] M be bounded and continuous. Then for any e > 



P 



p 



< 



< 



l/ll: 



i=0 ^ '' 



> e 



> 



f{n)Pt{dn) 



(8.7) 



and thus any weak limit of has identical marginals. Next note that by Jensen's 
inequality and since r"/T — A e (0, cx)), the uniform bound on the second rel- 
ative entropy sum implies that i?_R ([^"^]2 11/3" ) is uniformly bounded. Using that 
ini{R{-l\\Pa) ■■ J uj{du) = b} = felog| + (l + fc)logi±|, is follows that the weak 
limit of [^'^]2 = ^00 w.p.l. 

Next we consider the asymptotic properties of the collection {Mf(i)}, under 
boundedness of the third relative entropy sum. We use that the empirical measure of 
the {^f } tends to S^o- Since ■\U°'{i)) is the transition function of a finite state 
Markov chain this means that asymptotically the M^{i), k = 0, . . . , TVf are samples 
from the transition probability (j8.2p with x — U (i), and in particular that Mq 
is asymptotically conditionally independent with distribution {p(x) ^ 1 — p{x)) . In- 
deed, a martingale argument similar to (|8.7p shows that if 



i7°(i)(Cl)'^M»(i)(C2), 



and if is the weak limit of any convergent subsequence (which must exist by 
compactness), then 

([m°°]2|i({0} \y), \y)) = (Ay), i - p{y)) [M°°]i-a.s., (8.8) 

and the same is true for the empirical measure of {M^{i), fc = 0, . . . , iVf } . 

We next remove the third relative entropy sum from the representation (|8.6p . and 
obtain a lower bound for the right hand side using Lemma 18.31 Let af denote the 
distribution of the random variable 
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Then we have the lower bound 



logs [exp{-TF(r,^) - ool(,,.}c(i?-)}] (8.9) 



T 

> inf 



To study the lower bound of (|8.9p wc introduce the measures 

1 '■""^ 

K^iC X C2 X C3) = - 5] %.(,)(Ci)a.(t7"(z),C2|Afo"(*))<5M„n.)(^3) 



i=0 



where to" is the conditional mean of ff. The restriction on the control measures 
implies 



J2r^ <T<J2rt, 



i=0 



j=0 



and since function h is increasing on (1, 00), we can assume without loss of generality 
that TO^„_i < 1. 

We introduce the function £ : K.+ — > K.+ given by £ (b) ^ b log b — b+1. Note that 
h (b) — bl The sequence {n^^ is tight because S and {0, 1} are compact. Using 

the fact that selects the conditional distribution of ff and inf{i? (7 || cr^ ) : / U7 {du) 
= b} = h{b), we have 



E 



> E 



E 



1=0 



i=0 



E 



£{z)f{du X dz) 



Hence the uniform bound on the relative entropy sum gives that {^^} is tight. Let / : 
S' — > M be bounded and continuous. Since ai{U°'{i), -iMQli)) selects the conditional 
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distribution of 17°' {i + 1), for e > 



P 



I: E (/(£'''(; + 1)) -/ /tojaKCW.^IMSli)) 



> e 



< 



We find that 



= [k°°]2 W.p.l. 



(8.10) 



Using Fatou's lemma and the definition of (p, we get the lower bound ^i?([K°°]i^2|| 
<Si if) on the weak limit of the corresponding relative entropies (the second sum 
in (15^ 1. 

With regard to , comparing the form of [^'^] ^ and 4't gives 



Observe that 



z^^{dx X dz) — ^ [k^] ^ (dx). 

Because of the superlinearity of i we have uniform integrability, and thus passing to 
the hmit gives 



z[r]i{dx)[rUiidz\x) = A[k°^Wx). 



Hence with the definition b{x) = / z[^°°]2\iidz\x) , [d[K°°]i/d[^°°]i]{x) = b{x)/A. Us- 
ing Fatou's lemma, Jensen's inequality and the weak convergence, we get the following 
lower bound on the corresponding relative entropies (the first sum in (jS-Qp ): 



Ihninf / £{z)£^ {du x dz) 



> I £{z)^°°{du X dz) 

> I ^(^J z[ruiidz\x)^ irwx) 

£{b{x))[r]i{dx) 
A [ hil/bix))[K°^]iidx). 
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Let r(y) = A/h{y). Then we can write the combined lower bound on the relative 
entropies as 



AER{[k'^]i^2\\[i^'^]i® ^) + AE J h{l/b{x))[K^]i{dx) 

= AER{[k°^]i^2\\[k^]i ® V>) - AE J \ogr{y)[,i°"]i{dy) + A\ogA-A+l. (8.11) 
We next consider and fj^. Using the limiting properties of the Mj^{i), we have 



[p{y)Sy{C) + {I - p{y))Syn{C)] [rii 

VooiC) 



dt 



and 



T 



5^^^,){C)dt^ / [e°°]i = V^oo(C). 
Jc 



Note that this implies the relation 



Jc 



.12) 



Finally we consider the weighted empirical measure. By (jS.lip . Lemma [8.21 and 
Jensen's inequality we have the lower bound \F{Efioa) + K{Etl)oo)\ for the limit in- 
ferior of the right hand side of (|8.6p . where ^oo and ipoo are related by ()8.12p . Thus 
we need only show that 

K{Ei,^) > I^'iEnoo) = I°°iEfiao)- 



The equality follows from the definition of and (|8.12p . Let a(y) — [dExj^oo/ dp\{y). 
Then 



K{Eipoc) = 1 - / \/ a{x)a{y)fi{dx)ip{x, dy) 



and 



I°{Efioo) - 1 - j\^Hx) + a{xR)) {a{y) + a{yR))p{dx)a{x, dy). 
Using that p{x) = 1 — p{x^) and symmetry 

^/a{x)a{y)^ [p{dx) + p{dx^')] {p{x)a{x,dy) + p{x^)a{x'^,dy'^)) 
i J (^y/a{x)a{y) + a{xR)a{y^)^ p{dx) {p{x)a{x,dy) + p{x^)a{x^ ,dy"')) 
\ I [ Va(a;)a(y) + a{x^)a{y'^)j p{dx)a{x,dy). 
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We now use 



^ya{x)a{y) + y' a{x'^)a{y^) < {a{x) + a{x^)) {a{y) + a{y^)) 

to obtain K{Et/jao) > I^{Efjoo). [Wc note for later use that given fj that is absolutely 
continuous with respect to Lebesgue measure and such that I°°{fj) < oo, K{il)) = 
I^{f)) can be shown for a i/i that maps to f] by taking i/; = [ry + f]^\/2] 

8.3.4. Combining the Ccises. In the last subsection we showed that for any 
sequence r" such that r"/T Ae [1/C, C], 

liminf-;^log^ [exp{-r^^(7?^) -ool{,a/T}=(i?7T)}] > [F{Enoo) + I°°{Ef]oo)] 

> inf [F{v) + I°°{u)], 

and an argument by contradiction shows that the bound is uniform in A. Thus 



\TC\ 



^'^"^ - ^ log <! ^ [eM-TF{n^r) - ool{,a/T}c {RVT)}] 



> lim inf log < 



ITC\ 



TC- V i;[exp{-ri^(r?^)-ool{,./T}^(i?7T)}] 



> inf \F{v)+I°°(v)]. 

We now partition E [exp{— Ti^(?^^)] according to the various cases to obtain the overall 
lower bound 

1 



Imiinf - - log S [exp{-Ti^(77^^)}] 



>min<{ inf [F {v) + {v)] , {C - 1) h 



1 



Letting C — >■ oo and using the fact that < 1, we have the desired lower bound 



Ikninf-- logE [exp{-TF(77«-)}] > ^ mf^^ [F{y) + . 
8.4. Upper bound. The proof of the reverse inequality 



limsup--logi;[exp{-ri?(ry^-)}]< ivd [F{y) + I^{v)] (8.13) 

is simpler. Let bounded and continuous F be given. Given e > 0, we can find u that 
is absolutely continuous with respect to Lebesgue measure and for which 

[F{u) + I'^{y)]< inf [F{u)+I°°{u)]+e. 

Next we use that /°° is convex and I°°{iJ.) = to find r > such that 

[F{v^) + I°°{v^)\ < [F{iy) + I°°{iy)] + e 
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when i/'^ — T/i + (1 — r)i/. Note that if 6{x) ~ [di^ / dfi]{x) and 6^{x) — [di'^ / dfi]{x) , 
then e''{x) > T > and so < 1. 

We will construct a control to use in the representation such ryj^ will converge 
w.p. 1 to i^^ and RE'^ will converge w.p.l to These convergences will follow 

from the ergodic theorem, and the bound 9'^{x) > r will be used to argue that the 
ergodicity of the original process is inherited by the controlled process. 

We now proceed to the construction. Let <;{dx) = [i''^{dx) + {dx^)]/2, so that 
/°°(z^^) = K{c,). Let k{x) = [d^/djl]{x) > t/2. Then 



i^(<;) = 1 - y ^/K{x)K{y)fi{dx)ip{x, dy). 
Since n is uniformly bounded from below we have the dual relationship 



log J ^ K{x)K(y)id{dx)(p{x, dy) 



where 



(/?) - i / [log n{x) + log n{y)]v{dx,dy), 



~ e5[i°g'=(^)+i°g'=(J')l/2(da;)(p(a;,dy)' 



Since \J K[x)n(y) is symmetric in x and y automatically [rj]i = [?/]2, and using the 
bound k{x) > t/2 we can factor ri{dx x dy) — [T]]i{dx)(p{x,dy), where (p is the 
transition kernel of a uniformly ergodic Markov chain. Let 

a{x,dy\0) = a{x,dy\l) = ip{x,dy). 

Notice that from the definition of <fi, <fi{x, dy) — (f{x^, dy^), hence a has the property 
that 

a{x^,dy"\0) = a{x,dy\l) 

These will be the transition kernels used to construct the U°'{i). 

Now of course the invariant distribution of ip is [?7]i, and not the desired dis- 
tribution Let r{x) = [dt; / d[i]]i]{x) . Then r{x) identifies the way in which the 
distribution of the random variables should be modified so that in the continuous 
time the empirical measure [77] i is reshaped into Choose A so that equality holds 
in ()8.4|) . and set b{x) = Ar{x). Then /3f is chosen to be I3°'^^'^\ Consistent with the 
analysis of the upper bound, wc do not perturb the distribution of the other variables, 
so that Pi.k{-, •{•) — pi', •!•) a-iid a^f = fx". We then construct the controlled processes 
using these measures in exact analogy with the construction of the original process. 
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With this choice and Lemma [01 we obtain the bound 
1 



T 
< E 



log£;[exp{-r^^(77j-)}] 
FirT) + ^ E [RHU''{^),■\MS{^))\\a{U''i^),■\MSi^))] 



+R (^/S"' 



Hu'-(i)) 



pa 



Now apply the ergodic theorem, and use that w.p.l — R'^ /T — > 1/ J b{x)[ri]i{dx) 
= A, and also that asymptotically the conditional distribution of Mq (i) is given by 
{po{U {i)),pi{U (i))). Then right hand side of the last display converges to we have 
the limit 



+ 



1 



Jb{x)[fjWx) 
1 



E R {oi{x, dy\z) \\a{x, dy\z) ) p^_{x) 



logb{x)+b{x) - l)b{x)[r]]iidx) 



r{x)[T]]i{dx) 



JbixMiidx) 

= F{iy^)+A j R{(p{x,dy)\\Lp{x,dy))(,{dx) - A J \ogr{x)c;{dx) + AlogA - A + 1 
= F{v^)+I°°{u^). 

This completes the proof of (|8.13p and also the proof of Theorem 14.11 
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